Skip to content

ENH: add validate argument to merge #16275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 22, 2017

Conversation

nickeubank
Copy link
Contributor

Open question: does not currently check for whether "many" side of a merge is in fact many. I can't think of a case where I'm doing a one-to-many merge and I'd be worried if it was in fact one-to-one, but open to alternate opinions.

right_w_dups = right.append(pd.DataFrame({'a':['e'], 'c':['moo']}, index=4))
merge(left, right_w_dups, left_index=True, right_index=True, validate='one_to_many')

def f():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the with syntax

with pytest.raises(..):
...

.. _whatsnew_0210.enhancements.merge_validate:

``merge`` supports ``validate`` argument to test merge
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit verbose



if self.validate == "one_to_one" or self.validate == "1:1":
if not left_unique and not right_unique:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use in [..., ...]

@jreback jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 7, 2017
@nickeubank nickeubank force-pushed the check_merge branch 3 times, most recently from 8874c7d to e5a6ee1 Compare May 7, 2017 18:58
@codecov
Copy link

codecov bot commented May 7, 2017

Codecov Report

Merging #16275 into master will decrease coverage by 0.01%.
The diff coverage is 92.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16275      +/-   ##
==========================================
- Coverage   90.34%   90.33%   -0.02%     
==========================================
  Files         167      167              
  Lines       50908    50935      +27     
==========================================
+ Hits        45994    46011      +17     
- Misses       4914     4924      +10
Flag Coverage Δ
#multiple 88.12% <92.59%> (ø) ⬆️
#single 40.27% <3.7%> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.59% <ø> (-0.1%) ⬇️
pandas/core/reshape/merge.py 93.89% <92.59%> (-0.06%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/common.py 91.03% <0%> (+0.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea56550...e5a6ee1. Read the comment docs.

@codecov
Copy link

codecov bot commented May 7, 2017

Codecov Report

Merging #16275 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16275      +/-   ##
==========================================
+ Coverage   90.38%   90.41%   +0.03%     
==========================================
  Files         161      161              
  Lines       50999    51024      +25     
==========================================
+ Hits        46096    46134      +38     
+ Misses       4903     4890      -13
Flag Coverage Δ
#multiple 88.25% <100%> (+0.03%) ⬆️
#single 40.17% <4%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.69% <ø> (ø) ⬆️
pandas/core/reshape/merge.py 94.19% <100%> (+0.24%) ⬆️
pandas/core/indexes/datetimes.py 95.33% <0%> (+0.09%) ⬆️
pandas/plotting/_converter.py 65.05% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6fcec6...3488f00. Read the comment docs.

@nickeubank
Copy link
Contributor Author

@jreback I'm not sure the code in the PR checklist (git diff upstream/master --name-only -- '*.py' | flake8 --diff) works right. I don't get any errors with that, but clearly have flake8 errors (as evident from Travis tests).

@jreback
Copy link
Contributor

jreback commented May 7, 2017

@nickeubank
Copy link
Contributor Author

@jreback thanks -- but on my system I got no errors, when I had trailing white space and excessive empty lines between lines. I think the command should be git diff master -- '*.py' |flake8 --diff, no?

@jreback
Copy link
Contributor

jreback commented May 7, 2017

if u read the doc it explains what upstream/master is

@nickeubank
Copy link
Contributor Author

I'm fine with upstream/master (I was copying off doc which just has master). I think the problem is the --name-only option. So should be git diff upstream/master -- '*.py' | flake8 --diff.

@jreback
Copy link
Contributor

jreback commented May 7, 2017

This command will catch any stylistic errors in your changes specifically, but be beware it may not catch all of them. For example, if you delete the only usage of an imported function, it is stylistically incorrect to import an unused function. However, style-checking the diff will not catch this because the actual import is not part of the diff. Thus, for completeness, you should run this command, though it will take longer:

git diff master --name-only -- '*.py' | grep 'pandas/' | xargs -r flake8

@nickeubank
Copy link
Contributor Author

It's not even catching local stylistic issues. May be an order of arguments problem.

screen shot 2017-05-07 at 1 32 03 pm

@jreback
Copy link
Contributor

jreback commented May 7, 2017

I have this an alias and it works like a charm: git diff master --name-only -- '*.py' | grep 'pandas/' | xargs -r flake8

yeah arg ordering does sometimes matter.

@nickeubank
Copy link
Contributor Author

So maybe we need to change the check box command. Should i open as issue?

@jreback
Copy link
Contributor

jreback commented May 7, 2017

you can just do a PR (also make sure the constributing docs are consistent).

@@ -174,6 +174,17 @@

.. versionadded:: 0.17.0

validate: None or string, default None
If specified, checks to ensure merge is of specified type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need - or * on these? @jorisvandenbossche ?

validate: None or string, default None
If specified, checks to ensure merge is of specified type.
If "one_to_one" or "1:1", checks merge keys are unique in both
left and right dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checks if merge keys (and same on all)

@@ -952,6 +959,51 @@ def _validate_specification(self):
if len(self.right_on) != len(self.left_on):
raise ValueError("len(right_on) must equal len(left_on)")

def _validate(self):
# Get merging series:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply use self.left_join_keys and self.right_join_keys which are already lists of arrays (1 or more). No need to repeat this logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, but if I want to use duplicated(), it ends up taking just as much additional code (or more) to work with self.right_join_keys or self.left_join_keys -- I have to coerce them into DataFrames in a way that's amenable to their variable potential dimensions, then run duplicates. I don't think that's actually any better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show what u r doing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternate? Shortest safe one I have is:

pd.concat(list(map(pd.Series, left_keys)), axis='columns')

@@ -561,6 +565,9 @@ def __init__(self, left, right, how='inner', on=None,
# to avoid incompat dtypes
self._maybe_coerce_merge_keys()

if self.validate is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here on what is happening, pass in validate and don't make it an instance variable. This will just raise if there is an error.

@@ -724,6 +724,63 @@ def test_indicator(self):
how='outer', indicator=True)
assert_frame_equal(test5, hand_coded_result)

def test_validation(self):
left = DataFrame({'a': ['a', 'b', 'c', 'd'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add tests for merge_asof this applies to it as well (different file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


The ``validate`` argument for :func:`merge` function now checks whether a merge is
one-to-one, one-to-many, many-to-one, or many-to-many. If a merge is found to not
be an example of specified merge type, an exception will be raised.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue number at the end. You can just add this to Other Enhancements, doesn't need a sub-section (unless you want to add a full-on example).

@jreback
Copy link
Contributor

jreback commented May 7, 2017

can you add a small section in merge docs (and link from the doc-string to this). Also an example in the doc-string would be good as well. Finally there is an existing warning in the docs somewhat about this (e.g. checking for duplicates), can you revise.

may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.
Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user' s responsibility to manage duplicate values in keys before joining large DataFrames.

Checking for duplicate keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded tag, add a ref tag as well (e.g. _merging.validation::)


right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})

result = pd.merge(left, right, on='B', how='outer', validate="one_to_one");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this raises, so better to use a code-block and show what this is going to raise.

@@ -724,6 +724,63 @@ def test_indicator(self):
how='outer', indicator=True)
assert_frame_equal(test5, hand_coded_result)

def test_validation(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than repeating tests I think pull these out to another test file, test_api and run thru both merge and merge_asof (parametrize is best)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only issue I see is that there are requirements for merge_as that don't apply to merge (for example, merge_as requires sorting). I can do this, but will limit set of tests I can run to those that work on both (current tests for merge don't work for merge_asof)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just do some simple validation then (and leave them where they are). having completely duplicated tests is not that useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Added some simple validations to merge_asof

if self.right_index:
right_unique = not (self.orig_right.index.duplicated()).any()
else:
right_unique = not (self.orig_right[right_key].duplicated()).any()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like

MultiIndex.from_array(self.left_join_keys).is_unique is more clear / faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not convinced that's more readable then just having that original pandas code, but will put in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickeubank the point is you are repeating lots of code for no reason. The notion of figuring out which is what (from the input args) is actually non-trivial and will be refactored shortly. Don't repeat it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #16228 for the issue

- ``validate`` : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many", "many_to_one", "many_to_many"}, default None
If specified, checks if merge is of specified type.
* "one_to_one" or "1:1": check if merge keys are unique in both
left and right dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasets

@@ -551,6 +552,18 @@ standard database join operations between DataFrame objects:

.. versionadded:: 0.17.0

- ``validate`` : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many", "many_to_one", "many_to_many"}, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split this up and put the choices on the 2nd line, otherwise this gets too long (or you can use \ to split the line). maybe check how this all renders.

In the following example, there are duplicate values of ``B`` in the right DataFrame. As this is not a one-to-one merge -- as specified in the ``validate`` argument -- an exception will be raised.

.. code-block:: python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when using code-block, need to copy-past exactly from an ipython section (e.g. formatting counts).


If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception.

.. ipython:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to move this block before the other one and actually run it, then the previous code-block only needs to show the merge operation (as you will have already shown the df when you remove the suppress)

validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
"many_to_one", "many_to_many"}, default None
If specified, checks if merge is of specified type.
* "one_to_one" or "1:1": check if merge keys are unique in both
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to indent the 2nd row after the * @jorisvandenbossche ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also needs a blank line above this one (to delineate the itemize block)

@@ -341,6 +344,19 @@ def merge_asof(left, right, on=None,

.. versionadded:: 0.20.0

validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
"many_to_one", "many_to_many"}, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. as an aside we ought to use a template for these merge* doc-strings

def _validate(self, validate):

# Check uniqueness of each
if self.left_index:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not self.orig_left.index.is_unique

left_unique = MultiIndex.from_arrays(self.left_join_keys
).is_unique

if self.right_index:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

'nay', 'chirp']},
index=range(5))

merge(left, right, left_index=True, right_index=True, validate='1:1')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on why you are not checking the result values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I actually should test those come to think of it -- in theory no impact on output (it's just tests), but no reason to make that a tested behavior.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I find it making sense to also have this keyword for the merge_asof, as you do not match on exact keys anyway, but on the nearest.
Further, this also gets complicated because eg the allow_exact_matches can actually influence the fact there would be duplicate keys or not.
Further, the explanation should also be different, as you don't have the same concept of a "one_to_many" merge as you have with the normal merge, because even if you have duplicate keys, you end up with a one_to_one kind of merge

@@ -551,6 +552,18 @@ standard database join operations between DataFrame objects:

.. versionadded:: 0.17.0

- ``validate`` : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many", "many_to_one", "many_to_many"}, default None
If specified, checks if merge is of specified type.
* "one_to_one" or "1:1": check if merge keys are unique in both
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a blank line above this one (to delineate the itemize)

- ``validate`` : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many", "many_to_one", "many_to_many"}, default None
If specified, checks if merge is of specified type.
* "one_to_one" or "1:1": check if merge keys are unique in both
left and right dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has to be indented at the same level as "one...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the same for the two items below

validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
"many_to_one", "many_to_many"}, default None
If specified, checks if merge is of specified type.
* "one_to_one" or "1:1": check if merge keys are unique in both
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
"many_to_one", "many_to_many"}, default None
If specified, checks if merge is of specified type.
* "one_to_one" or "1:1": check if merge keys are unique in both
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also needs a blank line above this one (to delineate the itemize block)

@@ -174,6 +174,18 @@

.. versionadded:: 0.17.0

validate : {None, '1:1', '1:m', 'm:1', 'm:m', "one_to_one", "one_to_many",
"many_to_one", "many_to_many"}, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many_to_many is listed as a possibility, but is not explained in the list below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting here (multiple lines for the type specification, to render nice in the html docs it has to fit on one line) currently does not work nice with numpydoc (see numpy/numpydoc#87)

Putting just string, default None and then the options are in the list below is an alternative

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche This relates to the question at the top of this PR: should we check the manys? In other words, if someone passed "one_to_many" but the merge is actually "one_to_one", is that ok?

My sense is that's fine -- i've never been in a situation where getting 1:1 where I expected 1:m was a problem -- but one corollary is that "many_to_many" doesn't actually do anything. I like the idea of leaving it there -- it's nice to be able to label ones merges consistently in this way -- but that specification doesn't imply any tests (which is why I left it off the list).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with the idea of not checking that a 'many' is in practice a 'one'. So this means that 'many_to_many' should check nothing.

But apart from that, we should either not have it as an option, or either include it in the docstring (and just say that this does do nothing)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you already did that in the meantime!

).is_unique

# Check valid arg
if validate not in ['one_to_one', '1:1',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also put this in the end as an else clause, then don't need to list all options here again

raise ValueError("Merge keys are not unique in right dataset;"
" not a one-to-one merge")

if validate in ["one_to_many", "1:m"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if -> elif (the same below) (makes it clearer code-wise they are exclusive if statements)

# Check invalid arguments
with pytest.raises(ValueError):
merge(left, right, on='a', validate='jibberish')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add tests for when merging on multiple columns (and maybe also one test where combining eg left_index and right_on, but not needed to that for all of them)

Copy link
Contributor Author

@nickeubank nickeubank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with dropping from merge_asof. @jreback you ok with that?

@jreback
Copy link
Contributor

jreback commented May 11, 2017

yep

@nickeubank nickeubank force-pushed the check_merge branch 2 times, most recently from 9c24834 to d7de5fa Compare May 11, 2017 05:33
dataset.
* "many_to_one" or "m:1": checks if merge keys are unique in right
dataset.
* "many_to_may" or "m:m": allowed, but does not result in checks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may -> many

988 elif not right_unique:
--> 989 raise ValueError("Merge keys are not unique in right dataset;"
990 " not a one-to-one merge")
991
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can truncate this to only the final 'ValueError: .... ' line

``validate`` argument checks merge key uniqueness
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``validate`` argument for :func:`merge` function now checks whether a merge is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the merge function

and maybe "the optional validate argument" (to make clear that is not the default to check something)

@@ -174,6 +174,19 @@

.. versionadded:: 0.17.0

validate : String, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor nitpick) String -> string

@nickeubank nickeubank force-pushed the check_merge branch 2 times, most recently from 8b8f3b0 to 0c2ef6f Compare May 19, 2017 19:57
@nickeubank
Copy link
Contributor Author

@jorisvandenbossche @jreback Think I covered all suggestions!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some doc/style comments

@@ -551,6 +552,21 @@ standard database join operations between DataFrame objects:

.. versionadded:: 0.17.0

- ``validate`` : String, default None
If specified, checks if merge is of specified type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this render? I think it needs indenting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screen shot 2017-05-21 at 9 09 01 am

left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})

.. code-block:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to look like ipython output (the code block)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tweaked. jorvis said to suppress most of the traceback, but added the In [] and Out []. If you need anything else, please specify.


ValueError: Merge keys are not unique in right dataset; not a one-to-one merge

If the user is aware of the duplicates in the right `DataFrame` but wants to ensure there are no duplicates in the left DataFrame, one can use the `one_to_many` argument instead, which will not raise an exception.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use validate='one_to_many'


- The ``validate`` argument for :func:`merge` function now checks whether a merge is
one-to-one, one-to-many, many-to-one, or many-to-many. If a merge is found to not
be an example of specified merge type, an exception will be raised. (:issue:`16270`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a link to the doc-section

# Check data integrity
if validate in ["one_to_one", "1:1"]:
if not left_unique and not right_unique:
raise ValueError("Merge keys are not unique in either left"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to raise a MergeError (inherits from ValueError) instead here (as this is what we do for the other operations). Have a look and see if this is inconsistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted. I think that's fair.

@jorisvandenbossche jorisvandenbossche changed the title add validate argument to merge ENH: add validate argument to merge May 22, 2017
@jorisvandenbossche jorisvandenbossche merged commit b297983 into pandas-dev:master May 22, 2017
@jorisvandenbossche
Copy link
Member

@nickeubank Thanks!

@jorisvandenbossche
Copy link
Member

@nickeubank I just noticed the MergeError comment, which I think you actually did nod not change? (only in the doc)

So maybe we can do a follow-up PR if we want to change this from a ValueError to a MergeError. In any case, if we use this error, we also have to move it to pandas.errors module for public exposure. And mention it in the docstring that this error is raised.

@jorisvandenbossche jorisvandenbossche added this to the 0.21.0 milestone May 22, 2017
@jreback
Copy link
Contributor

jreback commented May 22, 2017

I agree. let's change ValueError -> MergeError (and then add MergeError to pandas.errors); also may need a doc-string :>

pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
@nickeubank
Copy link
Contributor Author

@jreback @jorisvandenbossche ugh, sorry -- I updated it but apparently didn't push the commit. Will open new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: option to check merge is one-to-one, many-to-one, one-to-many, or many-to-many
4 participants